Part 1: Lightweight Object Detection¶

Loss Curves¶

5 Sample Detections¶

Image: 0.png
  top 5 scores for 0.png: [0.20210139453411102, 0.20152655243873596, 0.19944915175437927, 0.19897283613681793, 0.1947309821844101]
  # boxes before NMS (top_k_pre): 50
  # boxes after custom NMS: 3
  # boxes after tv NMS:      3
Image: 1.png
  top 5 scores for 1.png: [0.20436616241931915, 0.19330492615699768, 0.19257889688014984, 0.1855223923921585, 0.18518216907978058]
  # boxes before NMS (top_k_pre): 50
  # boxes after custom NMS: 3
  # boxes after tv NMS:      3
Image: 10.png
  top 5 scores for 10.png: [0.19172115623950958, 0.17924265563488007, 0.17075741291046143, 0.1706967055797577, 0.16946221888065338]
  # boxes before NMS (top_k_pre): 50
  # boxes after custom NMS: 3
  # boxes after tv NMS:      3
Image: 11.png
  top 5 scores for 11.png: [0.1782817542552948, 0.17269743978977203, 0.1700783520936966, 0.1692238599061966, 0.16636724770069122]
  # boxes before NMS (top_k_pre): 50
  # boxes after custom NMS: 3
  # boxes after tv NMS:      3
Image: 12.png
  top 5 scores for 12.png: [0.20829883217811584, 0.1904813051223755, 0.18734796345233917, 0.17319230735301971, 0.16764065623283386]
  # boxes before NMS (top_k_pre): 50
  # boxes after custom NMS: 3
  # boxes after tv NMS:      3

Personal Images¶

Image: banana1.jpg
  top 5 scores for banana1.jpg: [0.20490090548992157, 0.19734756648540497, 0.19688843190670013, 0.19362686574459076, 0.19353023171424866]
  # boxes before NMS: 50
  # boxes after custom NMS: 3
  # boxes after tv NMS:      3
Image: banana2.jpg
  top 5 scores for banana2.jpg: [0.18971116840839386, 0.1891666054725647, 0.18829527497291565, 0.18690736591815948, 0.18597716093063354]
  # boxes before NMS: 50
  # boxes after custom NMS: 3
  # boxes after tv NMS:      3
Image: banana3.jpg
  top 5 scores for banana3.jpg: [0.24207830429077148, 0.22727727890014648, 0.21902689337730408, 0.21292327344417572, 0.2118563950061798]
  # boxes before NMS: 50
  # boxes after custom NMS: 3
  # boxes after tv NMS:      3
Image: banana4.jpg
  top 5 scores for banana4.jpg: [0.216048464179039, 0.21234364807605743, 0.20538221299648285, 0.20354531705379486, 0.20189881324768066]
  # boxes before NMS: 50
  # boxes after custom NMS: 3
  # boxes after tv NMS:      3
Image: banana5.jpg
  top 5 scores for banana5.jpg: [0.3227527439594269, 0.27928754687309265, 0.26482123136520386, 0.2580227255821228, 0.2434408962726593]
  # boxes before NMS: 50
  # boxes after custom NMS: 3
  # boxes after tv NMS:      3
Image: banana6.jpg
  top 5 scores for banana6.jpg: [0.2131563127040863, 0.20654091238975525, 0.20451699197292328, 0.18954293429851532, 0.18721458315849304]
  # boxes before NMS: 50
  # boxes after custom NMS: 3
  # boxes after tv NMS:      3

In summary, my SSD-lite banana detector successfully learns to localize a single “banana” class on the D2L dataset. The training loss curves show clear convergence, and on both validation images and my own banana photos, the model usually produces reasonable, although slightly loose, bounding boxes around the fruit. When the scene looks similar to the training data (a single, reasonably large banana on a clean background), the detector typically draws one box that covers most of the banana, which shows that the backbone, anchors, and loss design are sufficient for this simple setting.

However, I also observe a number of clear failure cases that highlight the limitations of my approach. When the banana is very small in the image or placed near the edges, the detector often either misses it entirely or fires on a nearby background region instead. In more cluttered scenes (e.g., banana on a messy desk or next to other yellow objects), the model sometimes places boxes on textures or colors that resemble the banana rather than on the banana itself, suggesting that the learned features are relatively shallow and sensitive to color rather than higher-level shape. On some of my custom images, where the banana is partially occluded or at an unusual orientation, the predicted box can drift and only cover part of the fruit, or become oversized and include a large chunk of background.

These behaviors are consistent with the design choices I made: a lightweight SSD-lite architecture with a tiny backbone, a single low-resolution feature map, coarse anchors, and training from scratch on a small, single-class dataset. A heavier detector such as Faster R-CNN with a pre-trained backbone and multi-scale features would generally handle small objects, clutter, and pose variation more robustly, but at a higher computational cost. In that sense, the inaccuracies I observe are not just random errors. They are directly tied to the architectural trade-offs I made to keep the model simple and efficient.

Part 2: Non-Maximum Suppression (NMS)¶

Discussion of Results (Images in Part 1)¶

For this project I implemented a standard greedy NMS routine that sorts boxes by score, repeatedly keeps the highest-scoring box, and removes all remaining boxes whose IoU with it exceeds a chosen threshold. I then compared it directly with torchvision.ops.nms by running both on the same decoded boxes and scores from my SSD-lite detector and visualizing the results. In practice, the two methods produced almost identical outputs on all test and personal images: the same 1–3 boxes were kept after suppression, and any differences were minor (usually just which of two nearly overlapping, similar-score boxes survived). This confirms that my implementation matches the behavior of the PyTorch reference.

Conceptually, NMS’s purpose is to clean up the dense set of overlapping anchor predictions and turn them into a small set of final detections (ideally one box per object). It does not make the model more accurate by itself. It only removes redundant boxes. The main limitations I observed are:

(1) NMS blindly trusts the model’s scores, so if the detector assigns higher scores to a wrong region (e.g., my hand or the cabinet instead of the banana), NMS will keep that wrong box.

(2) It cannot fix localization errors. If all high-scoring boxes are slightly off, then final box will also be off.

(3) It depends on a hand-chosen IoU threshold, which can either suppress too many boxes or leave duplicates if set poorly.

Part 3: Human–Object Interaction (HOI) Analysis using VLMs¶

For the HOI part, I used GPT as a vision–language model on three images: (1) a mounted police officer on a horse, (2) a motorcycle racer, and (3) a person reading a book. I prompted the model with: “List all human–object interactions in this image using the format (e.g., ride bicycle, hold phone).” GPT correctly produced interactions such as “ride horse / hold reins / wear helmet” for the first image and “read book / hold book” for the third image. These cases are relatively easy: the human and object are clear, and the interaction matches very common categories seen in training. The main failure case came from the motorcycle image. GPT sometimes hesitated between “ride motorcycle” and more generic descriptions like “lean over motorcycle” or “work on motorcycle”, likely because the racing posture and motion blur make the interaction visually ambiguous. I tried prompt refinements such as “Describe the main action the person is performing with the object (e.g., ride motorcycle, not stand near motorcycle).” This reduced but did not completely remove the ambiguity. Overall, GPT performs well on clear, everyday HOIs but can still misclassify actions when poses are extreme or the visual evidence is ambiguous, and prompt engineering only partially fixes these edge cases.